Day 1 Simple and Multiple Regression

Section 1: Simple Regression

Libraries

library(Statamarkdown) # This is only necessary for me
## Stata found at C:/Program Files/Stata16/StataMP-64.exe
## Warning in readLines("profile.do"): incomplete final line found on 'profile.do'
## Found an existing 'profile.do'
##    cd C:\Users\slupp\OneDrive\Skrivebord\NTNU\Mehmet\PSY8003\Simple_and_multiple_regression 
##     
##    use flats.dta
## The 'stata' engine is ready to use.
library(ggplot2) # Necessary for plotting data
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(pprint)

Loading The Dataset

R

  1. Before we do anything we need to open our dataset
  2. I have saved an .RDS file for R-users.
  3. .RDS files store a single R-object
  4. This can also be accessed directly by loading the astatur package and calling the object “flats”
# Changing working directory 
setwd("C:/Users/slupp/OneDrive/Skrivebord/NTNU/Mehmet/PSY8003/Simple_and_multiple_regression") 

# Reading the dataframe and storing it in an object named df
df <- readRDS("flats.rds") 

# Viewing the dataframe
View(df)

STATA

  1. I have stored the same dataset as a .dta file
// setting working directory
cd C:\Users\slupp\OneDrive\Skrivebord\NTNU\Mehmet\PSY8003\Simple_and_multiple_regression

// Loading the dataset 
use flats.dta

// Viewing the dataset
browse
C:\Users\slupp\OneDrive\Skrivebord\NTNU\Mehmet\PSY8003\Simple_and_multiple_regr
> ession


request ignored because of batch mode

Getting an Overview of the Dataset

R

# Base R:
summary(df)
   flat_price        location    floor_size       year_built  
 Min.   : 225000   centre:34   Min.   : 20.00   Min.   :1871  
 1st Qu.: 398333   south : 9   1st Qu.: 58.00   1st Qu.:1998  
 Median : 498333   west  :11   Median : 74.00   Median :2006  
 Mean   : 529383   east  :41   Mean   : 77.37   Mean   :2001  
 3rd Qu.: 588750               3rd Qu.: 88.00   3rd Qu.:2010  
 Max.   :1816667               Max.   :228.00   Max.   :2013  
                                                NA's   :16    
 energy_efficiency
 best    :10      
 mediocre:48      
 poor    :24      
 NA's    :13      
                  
                  
                  
str(df)
tibble [95 × 5] (S3: tbl_df/tbl/data.frame)
 $ flat_price       : num [1:95] 664167 225000 453333 446667 331667 ...
 $ location         : Factor w/ 4 levels "centre","south",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ floor_size       : num [1:95] 88 20 84 58 46 37 57 129 152 51 ...
  ..- attr(*, "label")= chr "størrelsen på leiligheten"
  ..- attr(*, "format.stata")= chr "%8.0g"
 $ year_built       : num [1:95] 1871 2005 1922 NA 2010 ...
  ..- attr(*, "label")= chr "året leiligheten er byggd"
  ..- attr(*, "format.stata")= chr "%8.0g"
 $ energy_efficiency: Factor w/ 3 levels "best","mediocre",..: NA 2 3 3 3 2 2 2 3 2 ...
# If you have installed the pprint package 
psummary(df)
# Color data frame (class colorDF) 8 x 5:
 │Col              │Class│NAs  │unique│Mean    │SD      │Min   │Max    
1│flat_price       │<dbl>│    0│    55│529382.5│222449.6│225000│1816667
2│location         │<fct>│    0│     4│      NA│      NA│    NA│     NA
3│floor_size       │<dbl>│    0│    56│    77.4│    32.6│    20│    228
4│year_built       │<dbl>│   16│    26│  2000.9│    19.4│  1871│   2013
5│energy_efficiency│<fct>│   13│     3│      NA│      NA│    NA│     NA

STATA

codebook
sum 
flat_price                                                          (unlabeled)
-------------------------------------------------------------------------------

                  type:  numeric (double)

                 range:  [225000,1816666.6]           units:  .01
         unique values:  55                       missing .:  0/95

                  mean:    529382
              std. dev:    222450

           percentiles:        10%       25%       50%       75%       90%
                            325000    398333    498333    595833    798333

-------------------------------------------------------------------------------
location                                                            (unlabeled)
-------------------------------------------------------------------------------

                  type:  numeric (long)
                 label:  location

                 range:  [1,4]                        units:  1
         unique values:  4                        missing .:  0/95

            tabulation:  Freq.   Numeric  Label
                            34         1  centre
                             9         2  south
                            11         3  west
                            41         4  east

-------------------------------------------------------------------------------
floor_size                                            størrelsen på leiligheten
-------------------------------------------------------------------------------

                  type:  numeric (double)

                 range:  [20,228]                     units:  1
         unique values:  56                       missing .:  0/95

                  mean:   77.3684
              std. dev:   32.6438

           percentiles:        10%       25%       50%       75%       90%
                                46        58        74        88       113

-------------------------------------------------------------------------------
year_built                                            året leiligheten er byggd
-------------------------------------------------------------------------------

                  type:  numeric (double)

                 range:  [1871,2013]                  units:  1
         unique values:  26                       missing .:  16/95

                  mean:   2000.89
              std. dev:   19.3934

           percentiles:        10%       25%       50%       75%       90%
                              1993      1997      2006      2010      2012

-------------------------------------------------------------------------------
energy_efficiency                                                   (unlabeled)
-------------------------------------------------------------------------------

                  type:  numeric (long)
                 label:  energy_efficiency

                 range:  [1,3]                        units:  1
         unique values:  3                        missing .:  13/95

            tabulation:  Freq.   Numeric  Label
                            10         1  best
                            48         2  mediocre
                            24         3  poor
                            13         .  

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
  flat_price |         95    529382.5    222449.6     225000    1816667
    location |         95    2.621053    1.354103          1          4
  floor_size |         95    77.36842    32.64381         20        228
  year_built |         79    2000.886    19.39336       1871       2013
energy_eff~y |         82    2.170732     .624695          1          3

Running a simple regression

R

  1. Regressions are done using the lm() function
  2. The syntax is:
# lm(formula, data, subset, weights, na.action,
#    method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
#    singular.ok = TRUE, contrasts = NULL, offset, ...)
  1. The most important arguments are formula and data

  2. The formula argument is derived from the mathemeatical specification of your model where “=” is substituded with “~”

    \(\hat{Y}_i \sim \hat{\beta_0} + \hat{\beta_1}X_i\)

  1. Since we haven’t estimated \(\hat{\beta}_0 + \hat{\beta}_1\) , we leave them out of the equation

  2. We’re then left with: \(\hat{Y}_i \sim X_i\)

  3. The similarity between the mathematical formula, and the formula used in the lm() function will become more apparent when we look at more complex models (e.g., multiple- and interaction regression)

  4. Lets say we want to see how well floor size predicts flat price

# Using named arguments and storing the model as an object 
reg_price_size <- lm(formula = flat_price ~ floor_size, 
                     data = df) 

# Using positional arguments 
reg_price_size <- lm(flat_price ~ floor_size, df)

# A little tip: 
  # When calling the functions this way R cannot autocomplete for you
  # But if we use the pipe operator from the margarittr/dplyr package 
  # We can get our variable_names autocompleted

reg_price_size <- df %>% lm(data =, #%>% pipes to the first argument
                            formula = flat_price ~ floor_size)

reg_price_size

Call:
lm(formula = flat_price ~ floor_size, data = .)

Coefficients:
(Intercept)   floor_size  
     121356         5274  
  1. Note that we do not get a lot of information by just calling the model it self
  2. To fix this we can either us the base function summary() or the preg() function from pprint
summary(reg_price_size)

Call:
lm(formula = flat_price ~ floor_size, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-234582  -70624   -8154   42566 1014989 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 121356.1    37547.9   3.232   0.0017 ** 
floor_size    5273.8      447.5  11.785   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 141600 on 93 degrees of freedom
Multiple R-squared:  0.5989,    Adjusted R-squared:  0.5946 
F-statistic: 138.9 on 1 and 93 DF,  p-value: < 2.2e-16
preg(reg_price_size)
 
 
 
 Model fit: flat_price 
# Color data frame (class colorDF) 6 x 3:
        │SS      │df   │M.      │M     │R            │R.      
   Model│2.79e+12│    1│F(1, 93)│138.89│    R-squared│5.99e-01
Residual│1.87e+12│   93│prob > F│ 4e-20│adj R-squared│5.95e-01
   Total│4.65e+12│   94│       N│    95│          MSE│1.42e+05

 
 Coefficients: flat_price 
# Color data frame (class colorDF) 6 x 2:
           │ Coefficients│ std.error│ t-value│ p-value│ CI[2.5%]│ CI[97.5%]
(Intercept)│       121356│     37548│    3.23│   0.002│    46793│    195919
 floor_size│         5274│       447│   11.79│  <2e-16│     4385│      6162

STATA

  1. In STATA it is a bit simpler
  2. There we can use the regress command (abbreviation: reg)
  3. Syntax: regress depvar [indepvars] [if] [in] [weight] [, options]
  4. Since each STATA-session only uses a single dataset we do not need to specify the dataset
  5. Further the reg command automatically summarizes the model
reg flat_price floor_size
      Source |       SS           df       MS      Number of obs   =        95
-------------+----------------------------------   F(1, 93)        =    138.89
       Model |  2.7860e+12         1  2.7860e+12   Prob > F        =    0.0000
    Residual |  1.8655e+12        93  2.0059e+10   R-squared       =    0.5989
-------------+----------------------------------   Adj R-squared   =    0.5946
       Total |  4.6515e+12        94  4.9484e+10   Root MSE        =    1.4e+05

------------------------------------------------------------------------------
  flat_price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  floor_size |   5273.811   447.4979    11.79   0.000     4385.169    6162.453
       _cons |   121356.1   37547.91     3.23   0.002     46793.35    195918.8
------------------------------------------------------------------------------

Interpretation

Does anyone want to interpret the results?

cat(interpretation_reg_price_size)
Error in eval(expr, envir, enclos): object 'interpretation_reg_price_size' not found

Plotting a Simple Regression

R

  1. We can use base-functions to make both 2D and 3D plots, but for plotting 2D graphs i prefer using ggplot(2)
  2. We use the ggplot function
  3. When using ggplot we must supply our dataframe
  4. ggplot works by layering graphics onto each other.
  5. Each ggplot builds upon a mapping, which most often is called by aes(x = x-axis, y = y-axis)
  6. We add layers on top using the + operator
  7. Here we use geom_smooth() to add a function summarizing our data
library(ggplot2)

ggplot(data = df,
       mapping = aes(x = floor_size,
                     y = flat_price)) +
  geom_smooth(method = "lm", se = F) #se = F removes error estimates
`geom_smooth()` using formula = 'y ~ x'

We can improve the plot by adding participant values using geom_point()

ggplot(data = df,
       mapping = aes(x = floor_size,
                     y = flat_price)) +
  geom_smooth(method = "lm", se = F) +
  geom_point()
`geom_smooth()` using formula = 'y ~ x'

We can also add the baseline model (i.e., the mean) using hline() (horizontal line)

ggplot(data = df,
       mapping = aes(x = floor_size,
                     y = flat_price)) +
  geom_smooth(method = "lm", se = F) +
  geom_point() +
  geom_hline(yintercept = mean(df$flat_price), 
             color = "red",
             linewidth = 1) 
`geom_smooth()` using formula = 'y ~ x'

We can customize labels using the labs() argument

ggplot(data = df,
       mapping = aes(x = floor_size,
                     y = flat_price)) +
  geom_smooth(method = "lm", se = F) +
  geom_point() +
  geom_hline(yintercept = mean(df$flat_price),
             color = "red",
             linewidth = 1) +
  labs(x = "Flat Price $", y = "Floor Size sqm")
`geom_smooth()` using formula = 'y ~ x'

STATA

  1. We can achieve the same in STATA using the twoway command.
  2. The twoway command builds upon the plot command, but allows layering of different plots
  3. It is a bit more restricted than ggplot, so i haven’t found a very elegant (while being easy) solution

Basic Regression Plot

twoway lfit flat_price floor_size

Adding data points

twoway (lfit flat_price floor_size) (scatter flat_price floor_size) 

Adding a mean line

// Creating function which is mean of flatprice
egen mean_var = mean(flat_price) 

// Plot
twoway (lfit flat_price floor_size) (scatter flat_price floor_size) (lfit mean_var floor_size)

Exercises

  1. Run a regression between flat price and year built
  2. Interpret the model as a whole, and the coefficients
  3. Create a graph representing your model

Section 2: Multiple Regression

R

  1. We use the exact same function for a multiple regression, as a simple regression
  2. As in the mathematical formula, we simply add a variable using the + operator
  3. $Y_i \sim \beta_1 X_{1i} + \beta_2X_{2i} ... \beta_kX_{ki} $
reg_price_size_year <- df %>% lm(data =, 
                                 formula = flat_price ~ floor_size + year_built)
summary(reg_price_size_year)

Call:
lm(formula = flat_price ~ floor_size + year_built, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-209201  -68960  -17571   55959 1001158 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1153762.9  1762246.8  -0.655    0.515    
floor_size      5367.4      506.2  10.604   <2e-16 ***
year_built       640.0      878.9   0.728    0.469    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 150100 on 76 degrees of freedom
  (16 observations deleted due to missingness)
Multiple R-squared:  0.5967,    Adjusted R-squared:  0.5861 
F-statistic: 56.22 on 2 and 76 DF,  p-value: 1.032e-15
preg(reg_price_size_year, std_beta = T)
 
 
 
 Model fit: flat_price 
# Color data frame (class colorDF) 6 x 3:
        │SS      │df   │M.      │M    │R            │R.      
   Model│2.53e+12│    2│F(2, 76)│56.22│    R-squared│5.97e-01
Residual│1.71e+12│   76│prob > F│1e-15│adj R-squared│5.86e-01
   Total│4.25e+12│   78│       N│   79│          MSE│1.50e+05

 
 Coefficients: flat_price 
# Color data frame (class colorDF) 7 x 3:
           │ Coefficients│std.beta│ std.error│ t-value│ p-value│ CI[2.5%]
(Intercept)│     -1153763│   0.000│   1762247│  -0.655│   0.515│ -4663582
 floor_size│         5367│   0.775│       506│  10.604│  <2e-16│     4359
 year_built│          640│   0.053│       879│   0.728│   0.469│    -1111
           │ CI[97.5%]
(Intercept)│   2356056
 floor_size│      6376
 year_built│      2391

STATA

reg flat_price floor_size year_built
      Source |       SS           df       MS      Number of obs   =        79
-------------+----------------------------------   F(2, 76)        =     56.22
       Model |  2.5332e+12         2  1.2666e+12   Prob > F        =    0.0000
    Residual |  1.7121e+12        76  2.2528e+10   R-squared       =    0.5967
-------------+----------------------------------   Adj R-squared   =    0.5861
       Total |  4.2453e+12        78  5.4427e+10   Root MSE        =    1.5e+05

------------------------------------------------------------------------------
  flat_price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  floor_size |   5367.429    506.182    10.60   0.000      4359.28    6375.577
  year_built |   640.0368   878.9464     0.73   0.469    -1110.537     2390.61
       _cons |   -1153763    1762247    -0.65   0.515     -4663582     2356056
------------------------------------------------------------------------------

Anybody want to intepret the results?

cat(interpretation_reg_price_size_year)
Error in eval(expr, envir, enclos): object 'interpretation_reg_price_size_year' not found

Plotting Multiple Regressions

  1. When we have a multiple regression we can no longer represent it as is, in two dimensions.
  2. If it has only two predictors, we can represent it as a three dimensional plot
# install.packages("plot3D")
library(plot3D)

grid_size <- 30

# NA's cause problems since it affects the surface and scatter object differently
df2 <- na.omit(df)

reg_price_size_year <- df2 %>% lm(data =, 
                                 formula = flat_price ~ floor_size + year_built)

# Creating a grid representing our regression plane
seq_floor_size <- seq(min(df2$floor_size, na.rm = T),
                      max(df2$floor_size, na.rm = T),
                      length.out = grid_size)

seq_year_built <- seq(min(df2$year_built, na.rm = T),
                      max(df2$year_built, na.rm = T),
                      length.out = grid_size)

year_size <- expand.grid(floor_size = seq_floor_size,
                   year_built = seq_year_built)

predicted_price <- predict(reg_price_size_year, 
                           newdata = year_size) |>
  matrix(ncol = grid_size, nrow = grid_size)



fitpoints <- predict(reg_price_size_year)

scatter3D(x = df2$floor_size,
          y = df2$year_built,
          z = df2$flat_price,
          # Regression plane
          surf = list(x = seq_floor_size,
                      y = seq_year_built,
                      z = predicted_price,
                      # Residuals
                      fit = fitpoints,
                      # Surface lines
                      facets = NA, 
                      col = "black"),
          # Plot Settings
          pch = 18, 
          ticktype = "detailed",
          bty = "g",
          col = "black",
            # Viewpoint 
          theta = 20, 
          phi = 15)

# install.packages("plot3D")
library(plotly)

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
grid_size <- 30

# Creating a grid representing our regression plane
seq_floor_size <- seq(min(df2$floor_size, na.rm = T),
                      max(df2$floor_size, na.rm = T),
                      length.out = grid_size)

seq_year_built <- seq(min(df2$year_built, na.rm = T),
                      max(df2$year_built, na.rm = T),
                      length.out = grid_size)

year_size <- expand.grid(floor_size = seq_floor_size,
                   year_built = seq_year_built)

predicted_price <- matrix(predict(reg_price_size_year, 
                                  newdata = year_size),
                          ncol = grid_size, 
                          nrow = grid_size, 
                          byrow = TRUE) 

xyz <- cbind(seq_floor_size, 
             seq_year_built, 
             predict(reg_price_size_year, newdata = year_size))



fitpoints <- predict(reg_price_size_year)

plot_ly(data = df2,
        type = "scatter3d",
        #mode = "markers",
        x = ~floor_size,
        y = ~year_built,
        z = ~flat_price,
        color = ~location,
        size = I(150)) %>%
  add_surface(x = seq_floor_size,
              y = seq_year_built,
              z = predicted_price,
              opacity = .3)
No scatter3d mode specifed:
  Setting the mode to markers
  Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

A simpler version:

#install.packages("margins")
library(margins)
reg_price_size_year <- lm(flat_price ~ floor_size + year_built, data = df) #this must be called this way, for the persp function to work

persp(reg_price_size_year, 
      xvar = "floor_size", 
      yvar = "year_built",
      theta = -30,
      phi = 15)

It is however easier using 2D graphs

library(margins)

# Making a dataframe with predictions at different combinations of values
margins_regression <- margins(reg_price_size_year,
                              at = list(floor_size = c(50,
                                                       100,
                                                       150,
                                                       200),
                                        year_built = c(1970,
                                                       1990,
                                                       2000,
                                                       2010)))
# Converting year_built predictions to a factor, so it can be used as a categorical variable in ggplot 
margins_regression <- mutate(margins_regression, 
                             year_built = factor(year_built,
                                                    levels = c(1970,
                                                               1990,
                                                               2000,
                                                               2010),
                                                    labels = c("< 1970",
                                                               "< 1990",
                                                               "< 2000",
                                                               "< 2013")))

# Creating a categorical variable for year_built for observed values, so it can be used as a categorical variable in scatterplot
df_scatter <- df %>% mutate(year_built = case_when(year_built <= 1970 ~ "< 1970",
                                                   year_built <= 1990 ~ "< 1990",
                                                   year_built <= 2000 ~ "< 2000",
                                                   year_built <= 2013 ~ "< 2013"))



ggplot(margins_regression,
       mapping = aes(x = floor_size, 
                     y = fitted, 
                     colour = year_built)) + 
  geom_smooth(method = "lm", se = F, fullrange = TRUE) +
  geom_point(df_scatter, 
             mapping = aes(x = floor_size,
                          y = flat_price))
`geom_smooth()` using formula = 'y ~ x'

# Making a dataframe with predictions at different combinations of values
margins_regression <- margins(reg_price_size_year,
                              at = list(floor_size = c(50,
                                                       100,
                                                       150,
                                                       200),
                                        year_built = c(1940,
                                                       1990,
                                                       2000,
                                                       2010)))
# Converting floor_size predictions to a factor, so it can be used as a categorical variable in ggplot 
margins_regression <- mutate(margins_regression, 
                             floor_size = factor(floor_size,
                                                    levels = c(50,
                                                               100,
                                                               150,
                                                               200),
                                                    labels = c("< 50",
                                                               "< 100",
                                                               "< 150",
                                                               "> 150")))
# Creating a categorical variable for floor_size for observed values, so it can be used as a categorical variable in scatterplot
df_scatter <- df %>% mutate(floor_size = case_when(floor_size <= 50 ~ "< 50",
                                                   floor_size <= 100 ~"< 100",
                                                   floor_size <= 150 ~ "< 150",
                                                   floor_size > 150 ~ "> 150"))
ggplot(margins_regression,
       mapping = aes(x = year_built, 
                     y = fitted, 
                     colour = floor_size)) + 
  geom_smooth(method = "lm", se = F, fullrange = TRUE) +
  # adding scatterplot (using) observed values
  geom_point(df_scatter, 
             mapping = aes(x = year_built,
                          y = flat_price))
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 16 rows containing missing values (`geom_point()`).

STATA

reg flat_price floor_size year_built

// X-axis = year_built
margins, at(year_built = (1930(50)2010) floor_size = (60(50)220))
marginsplot

// X-axis = floor_size 
margins, at(floor_size = (60(50)220) year_built = (1930(30)2010))
marginsplot
      Source |       SS           df       MS      Number of obs   =        79
-------------+----------------------------------   F(2, 76)        =     56.22
       Model |  2.5332e+12         2  1.2666e+12   Prob > F        =    0.0000
    Residual |  1.7121e+12        76  2.2528e+10   R-squared       =    0.5967
-------------+----------------------------------   Adj R-squared   =    0.5861
       Total |  4.2453e+12        78  5.4427e+10   Root MSE        =    1.5e+05

------------------------------------------------------------------------------
  flat_price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  floor_size |   5367.429    506.182    10.60   0.000      4359.28    6375.577
  year_built |   640.0368   878.9464     0.73   0.469    -1110.537     2390.61
       _cons |   -1153763    1762247    -0.65   0.515     -4663582     2356056
------------------------------------------------------------------------------


Adjusted predictions                            Number of obs     =         79
Model VCE    : OLS

Expression   : Linear prediction, predict()

1._at        : floor_size      =          60
               year_built      =        1930

2._at        : floor_size      =          60
               year_built      =        1980

3._at        : floor_size      =         110
               year_built      =        1930

4._at        : floor_size      =         110
               year_built      =        1980

5._at        : floor_size      =         160
               year_built      =        1930

6._at        : floor_size      =         160
               year_built      =        1980

7._at        : floor_size      =         210
               year_built      =        1930

8._at        : floor_size      =         210
               year_built      =        1980

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         _at |
          1  |   403553.8   65859.33     6.13   0.000     272383.5      534724
          2  |   435555.6   27030.97    16.11   0.000     381718.8    489392.4
          3  |   671925.2   65373.07    10.28   0.000     541723.4    802126.9
          4  |     703927   28966.64    24.30   0.000       646235    761619.1
          5  |   940296.6   74100.79    12.69   0.000     792712.1     1087881
          6  |   972298.5   47207.63    20.60   0.000     878276.3     1066321
          7  |    1208668   89382.98    13.52   0.000      1030646     1386690
          8  |    1240670   69993.97    17.73   0.000      1101265     1380075
------------------------------------------------------------------------------


  Variables that uniquely identify margins: year_built floor_size


Adjusted predictions                            Number of obs     =         79
Model VCE    : OLS

Expression   : Linear prediction, predict()

1._at        : floor_size      =          60
               year_built      =        1930

2._at        : floor_size      =          60
               year_built      =        1960

3._at        : floor_size      =          60
               year_built      =        1990

4._at        : floor_size      =         110
               year_built      =        1930

5._at        : floor_size      =         110
               year_built      =        1960

6._at        : floor_size      =         110
               year_built      =        1990

7._at        : floor_size      =         160
               year_built      =        1930

8._at        : floor_size      =         160
               year_built      =        1960

9._at        : floor_size      =         160
               year_built      =        1990

10._at       : floor_size      =         210
               year_built      =        1930

11._at       : floor_size      =         210
               year_built      =        1960

12._at       : floor_size      =         210
               year_built      =        1990

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         _at |
          1  |   403553.8   65859.33     6.13   0.000     272383.5      534724
          2  |   422754.9   41350.41    10.22   0.000     340398.4    505111.3
          3  |     441956   21745.02    20.32   0.000       398647    485264.9
          4  |   671925.2   65373.07    10.28   0.000     541723.4    802126.9
          5  |   691126.3   41825.44    16.52   0.000     607823.7    774428.9
          6  |   710327.4    24812.9    28.63   0.000     660908.2    759746.6
          7  |   940296.6   74100.79    12.69   0.000     792712.1     1087881
          8  |   959497.7   55407.39    17.32   0.000     849144.3     1069851
          9  |   978698.8    45162.1    21.67   0.000     888750.7     1068647
         10  |    1208668   89382.98    13.52   0.000      1030646     1386690
         11  |    1227869   75310.63    16.30   0.000      1077875     1377863
         12  |    1247070   68881.43    18.10   0.000      1109881     1384260
------------------------------------------------------------------------------


  Variables that uniquely identify margins: floor_size year_built

Exercises

  1. Perform a multiple regression with flat_price as a dependent variable where independent variables are floor_size and energy_efficiency
    1. NB: Note that energy_efficiency is coded as a nominal variable. In this exercise i want you to treat it as a continuous variable, so it must be recoded as such.
    2. I encourage you (Regardless of which program you use) to recode the variable on your own. If you struggle to do so, i have included a solution under [Appendix: Recoding energy_efficiency]
  2. Interpret the results
  3. Visualize your model in a plot (you decide what plot you want to use)
  4. Perform a multiple regression with flat_price as a dependent variable where independent variables are floor_size year_built and energy_efficiency
  5. Interpret the results
  6. Try to visualize this model in a twodimensional plot using the margins commands in STATA, or the margins package in R (NB this is a bit harder) I have included a (suggested) solution Appendix: Solution Exercise 6

Appendix

Appendix: Using ggeffects to visualize regressions.

Depending on the circumstances it can be a bit easier to use the ggeffects package, compared to the margins package, it is especially good for visualizing interaction effects.

#install.packages("ggeffects")
library(ggeffects)

prediction <- ggpredict(reg_price_size_year, 
                        terms = c("year_built", "floor_size"))

df <- df %>% mutate(group = case_when(floor_size <= 46.8 ~ "46.8", 
                                      floor_size <= 77.4 ~ "77.4",
                                      floor_size <= 108 ~ "108"))
                                   

prediction %>% ggplot(mapping = aes(x = x,
                                    y = predicted,
                                    colour = group)) +
  geom_smooth(method = "lm", 
              formula = y ~ x, 
              se = F) +
  # This adds a scatter plot, but is not necessary for visualizing the model
  geom_point(data = df, 
             mapping = aes(x = year_built, 
                           y = flat_price,
                           colour = group), inherit.aes = FALSE) +
  ylab("Flat Price USD") +
  xlab("Year Built")
Warning: Removed 16 rows containing missing values (`geom_point()`).

Appendix: Recoding energy_efficiency

R

df <- df %>% 
  mutate(i.energy_efficiency = case_when(energy_efficiency == "poor" ~ 1,
                                             energy_efficiency == "mediocre" ~ 2,
                                             energy_efficiency == "best" ~ 3))
# NB we could also use as.integer(energy_efficency), but this would cause the variable to be coded in reverse order (i.e., best = 1, and poor = 3)

STATA


// When recoding the variable this way, labels are removed and it becomes numeric. Note also that the variable is coded in reverse order

recode energy_efficiency (1 = 3) (2 = 2) (3 = 1)

reg(flat_price floor_size energy_efficiency)
(energy_efficiency: 34 changes made)

      Source |       SS           df       MS      Number of obs   =        82
-------------+----------------------------------   F(2, 79)        =     64.10
       Model |  2.5440e+12         2  1.2720e+12   Prob > F        =    0.0000
    Residual |  1.5677e+12        79  1.9844e+10   R-squared       =    0.6187
-------------+----------------------------------   Adj R-squared   =    0.6091
       Total |  4.1116e+12        81  5.0761e+10   Root MSE        =    1.4e+05

------------------------------------------------------------------------------
  flat_price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  floor_size |   5747.269   523.2311    10.98   0.000     4705.803    6788.734
energy_eff~y |   80118.73   25076.54     3.19   0.002     30205.12    130032.3
       _cons |  -64946.35   64265.62    -1.01   0.315    -192863.9    62971.16
------------------------------------------------------------------------------
// As with R there is a simpler soloution where we simple use c. (for continous) at the start of our variable when calling our regression. However the variable would come in reversed order. 

reg(flat_price floor_size c.energy_efficiency)
> us) at the start of our variable when calling our regression. However the var
> iable would come in reversed order. 

      Source |       SS           df       MS      Number of obs   =        82
-------------+----------------------------------   F(2, 79)        =     64.10
       Model |  2.5440e+12         2  1.2720e+12   Prob > F        =    0.0000
    Residual |  1.5677e+12        79  1.9844e+10   R-squared       =    0.6187
-------------+----------------------------------   Adj R-squared   =    0.6091
       Total |  4.1116e+12        81  5.0761e+10   Root MSE        =    1.4e+05

------------------------------------------------------------------------------
  flat_price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  floor_size |   5747.269   523.2311    10.98   0.000     4705.803    6788.734
energy_eff~y |  -80118.73   25076.54    -3.19   0.002    -130032.3   -30205.12
       _cons |   255528.6   68240.54     3.74   0.000     119699.2      391358
------------------------------------------------------------------------------

Appendix: Solution Exercise 6

R

reg_price_size_energy_year <- lm(flat_price ~ floor_size + energy_efficiency + year_built, data = df)

margins_regression <- margins(reg_price_size_energy_year,
                              at = list(floor_size = c(50,
                                                       100,
                                                       150,
                                                       200),
                                        year_built = c(1970,
                                                       1990,
                                                       2000,
                                                       2010),
                                        i.energy_efficiency = c(1:3)))

margins_regression <- mutate(margins_regression, 
                             year_built = factor(year_built,
                                                    levels = c(1970,
                                                               1990,
                                                               2000,
                                                               2010),
                                                    labels = c("< 1970",
                                                               "< 1990",
                                                               "< 2000",
                                                               "< 2013")))
margins_regression <- mutate(margins_regression, # We could not just is it as a factor from the start, since it would change the model 
                             i.energy_efficiency = factor(i.energy_efficiency,
                                                          levels = c(1:3),
                                                          labels= c("Poor",
                                                                    "Mediocre",
                                                                    "Best")))


margins_regression %>% ggplot(mapping = aes(x = floor_size,
                                            y = fitted,
                                            colour = year_built,
                                            linetype = energy_efficiency) ) +
  geom_point() +
  geom_smooth(method = "lm", se = F)
`geom_smooth()` using formula = 'y ~ x'

STATA

// Regression flat_price ~ floor_size + energy_efficiency + year_built
reg flat_price floor_size c.energy_efficiency year_built

// Visualizing Regression using margins and marginsplot
margins, at(floor_size = (60(50)220) year_built = (1930(30)2010) energy_efficiency = (1(1)3))
marginsplot
      Source |       SS           df       MS      Number of obs   =        67
-------------+----------------------------------   F(3, 63)        =     32.80
       Model |  2.2832e+12         3  7.6106e+11   Prob > F        =    0.0000
    Residual |  1.4620e+12        63  2.3206e+10   R-squared       =    0.6096
-------------+----------------------------------   Adj R-squared   =    0.5910
       Total |  3.7452e+12        66  5.6745e+10   Root MSE        =    1.5e+05

------------------------------------------------------------------------------
  flat_price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  floor_size |   6018.797   615.7774     9.77   0.000     4788.265     7249.33
energy_eff~y |  -74322.56   34854.55    -2.13   0.037    -143973.8     -4671.3
  year_built |   1340.768   1483.367     0.90   0.370    -1623.504    4305.041
       _cons |   -2458863    2993242    -0.82   0.414     -8440380     3522655
------------------------------------------------------------------------------


Adjusted predictions                            Number of obs     =         67
Model VCE    : OLS

Expression   : Linear prediction, predict()

1._at        : floor_size      =          60
               energy_eff~y    =           1
               year_built      =        1930

2._at        : floor_size      =          60
               energy_eff~y    =           1
               year_built      =        1960

3._at        : floor_size      =          60
               energy_eff~y    =           1
               year_built      =        1990

4._at        : floor_size      =          60
               energy_eff~y    =           2
               year_built      =        1930

5._at        : floor_size      =          60
               energy_eff~y    =           2
               year_built      =        1960

6._at        : floor_size      =          60
               energy_eff~y    =           2
               year_built      =        1990

7._at        : floor_size      =          60
               energy_eff~y    =           3
               year_built      =        1930

8._at        : floor_size      =          60
               energy_eff~y    =           3
               year_built      =        1960

9._at        : floor_size      =          60
               energy_eff~y    =           3
               year_built      =        1990

10._at       : floor_size      =         110
               energy_eff~y    =           1
               year_built      =        1930

11._at       : floor_size      =         110
               energy_eff~y    =           1
               year_built      =        1960

12._at       : floor_size      =         110
               energy_eff~y    =           1
               year_built      =        1990

13._at       : floor_size      =         110
               energy_eff~y    =           2
               year_built      =        1930

14._at       : floor_size      =         110
               energy_eff~y    =           2
               year_built      =        1960

15._at       : floor_size      =         110
               energy_eff~y    =           2
               year_built      =        1990

16._at       : floor_size      =         110
               energy_eff~y    =           3
               year_built      =        1930

17._at       : floor_size      =         110
               energy_eff~y    =           3
               year_built      =        1960

18._at       : floor_size      =         110
               energy_eff~y    =           3
               year_built      =        1990

19._at       : floor_size      =         160
               energy_eff~y    =           1
               year_built      =        1930

20._at       : floor_size      =         160
               energy_eff~y    =           1
               year_built      =        1960

21._at       : floor_size      =         160
               energy_eff~y    =           1
               year_built      =        1990

22._at       : floor_size      =         160
               energy_eff~y    =           2
               year_built      =        1930

23._at       : floor_size      =         160
               energy_eff~y    =           2
               year_built      =        1960

24._at       : floor_size      =         160
               energy_eff~y    =           2
               year_built      =        1990

25._at       : floor_size      =         160
               energy_eff~y    =           3
               year_built      =        1930

26._at       : floor_size      =         160
               energy_eff~y    =           3
               year_built      =        1960

27._at       : floor_size      =         160
               energy_eff~y    =           3
               year_built      =        1990

28._at       : floor_size      =         210
               energy_eff~y    =           1
               year_built      =        1930

29._at       : floor_size      =         210
               energy_eff~y    =           1
               year_built      =        1960

30._at       : floor_size      =         210
               energy_eff~y    =           1
               year_built      =        1990

31._at       : floor_size      =         210
               energy_eff~y    =           2
               year_built      =        1930

32._at       : floor_size      =         210
               energy_eff~y    =           2
               year_built      =        1960

33._at       : floor_size      =         210
               energy_eff~y    =           2
               year_built      =        1990

34._at       : floor_size      =         210
               energy_eff~y    =           3
               year_built      =        1930

35._at       : floor_size      =         210
               energy_eff~y    =           3
               year_built      =        1960

36._at       : floor_size      =         210
               energy_eff~y    =           3
               year_built      =        1990

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         _at |
          1  |   415625.4   123336.4     3.37   0.001     169157.2    662093.6
          2  |   455848.5   82387.39     5.53   0.000     291210.4    620486.5
          3  |   496071.5   48209.67    10.29   0.000     399732.2    592410.9
          4  |   341302.9   109000.6     3.13   0.003     123482.5    559123.2
          5  |   381525.9   65953.67     5.78   0.000       249728    513323.8
          6  |     421749    27916.3    15.11   0.000     365962.7    477535.2
          7  |   266980.3   104785.8     2.55   0.013     57582.49    476378.1
          8  |   307203.3   65892.11     4.66   0.000     175528.5    438878.2
          9  |   347426.4   40793.96     8.52   0.000     265906.2    428946.6
         10  |   716565.3   123420.5     5.81   0.000     469929.1    963201.5
         11  |   756788.3   83544.81     9.06   0.000     589837.4    923739.3
         12  |   797011.4   51841.36    15.37   0.000     693414.7    900608.1
         13  |   642242.7   108457.6     5.92   0.000     425507.6    858977.9
         14  |   682465.8   66355.85    10.28   0.000     549864.2    815067.4
         15  |   722688.8   31683.34    22.81   0.000     659374.7    786002.9
         16  |   567920.2   103552.6     5.48   0.000     360986.8    774853.5
         17  |   608143.2   65239.12     9.32   0.000     477773.2    738513.2
         18  |   648366.3   41830.87    15.50   0.000     564773.9    731958.6
         19  |    1017505   130955.3     7.77   0.000     755811.9     1279198
         20  |    1057728   95224.45    11.11   0.000     867437.3     1248019
         21  |    1097951   70333.41    15.61   0.000     957401.2     1238501
         22  |   943182.6   116365.2     8.11   0.000     710645.3     1175720
         23  |   983405.7   79700.82    12.34   0.000     824136.3     1142675
         24  |    1023629   55895.06    18.31   0.000     911931.3     1135326
         25  |     868860   111185.1     7.81   0.000     646674.4     1091046
         26  |   909083.1   77887.26    11.67   0.000     753437.8     1064728
         27  |   949306.1   61085.23    15.54   0.000       827237     1071375
         28  |    1318445   144782.1     9.11   0.000      1029121     1607769
         29  |    1358668   114243.5    11.89   0.000      1130371     1586965
         30  |    1398891    95404.2    14.66   0.000      1208241     1589541
         31  |    1244122   131204.4     9.48   0.000     981931.4     1506314
         32  |    1284346   100981.4    12.72   0.000      1082550     1486141
         33  |    1324569   84502.01    15.67   0.000      1155705     1493432
         34  |    1169800   126083.4     9.28   0.000     917842.4     1421757
         35  |    1210023   98856.54    12.24   0.000      1012474     1407572
         36  |    1250246      87229    14.33   0.000      1075933     1424559
------------------------------------------------------------------------------


  Variables that uniquely identify margins: floor_size year_built
      energy_efficiency

Answers to interpretations (only used to call via cat-function)

interpretation_reg_price_size <- "1. Firstly we can see that the overall model is significant p < .001 \n

2. This means that there is less than 0.1% likelyhood to get the same results (or stronger) given no effect \n

3. Secondly we can see that the coefficient for floor_size is significant both in terms of p-value and confidence interval \n 

4. The coefficient is 5274$, meaning that flat price on average increases with 5274$ per square meter increase in floor size. \n

5. The coefficient is significant p < .001, and CI does not overlap with 0 \n
6. The intercept/constant is 121356 meaning that we predict that a flat with 0 sqm floor size would (on average) cost 121 356 $. \n

7. Lastly we can see that the model (or rather floor_size) explains 60% of the variance in flat_prize"

interpretation_reg_price_size_year <- "1. Firstly we can see that the overall model is significant p < .001 \n

2. This means that there is less than 0.1% likelyhood to get the same results (or stronger) given no effect \n

3. Secondly we can see that the coefficient for floor_size is significant both in terms of p-value and confidence interval \n 

4. We can also see that the model explains roughly 60% of the variance, and about 59% if penalized for adding an extra variable. \n 

5. Notice that this is basicly the same as our previous model, suggesting that year_built (in our dataset) is not a strong predictor of flat price \n

6. The coefficient for floor_size is 5367 (p < .001), meaning that flat price on average increases with 5367$ per square meter increase in floor size, controlled for the effect of year_built \n

7. The coefficient for year_built is 640 (p = .469), then, it seeems that year built is not associated with flat price. If we were to interpet the coefficent however, we could say that flat price (on average) increases by 640$ per one year increase in buildyear. We could then say that the price decreases (on average) by 640$ per year old it is."